Iteration Three – Formula 1

Kaloyan Rakov

Formula 1 is a motorsport race, including 10 teams with 2 drivers each. The championship is divided into 2 titles: the Drivers' Championship (individual for each driver) and the Constructor’s Championship (Team-based). In this notebook I will dive into all the factors that have an effect on the drivers' performances and create a machine-learning model that will predict them.

In [1]:
from IPython.display import Image
Image(filename='Mercedes.jpg')
Out[1]:
No description has been provided for this image

Data provisioning¶

What data are we working with:

For this project I am working with multiple datasets. One of them is the data from last season, as it gives us important context about the drivers and their past performances. The "Formula1_2024season_raceResults.csv" gives us access to every result every driver has produced throughout all of the races on the 2024 calendar. The other datasets are collected from each individual race and are based on the drivers' qualifying results for said race. It's important to look at each race individually, since performances vary considerably from track to track and otherwise we would be oversimplifying the project. The data for the qualifications is saved as QUALIFYING.csv. I am also using the racer's starting positions (STARTING_GRID.csv) and the final results (RESULTS.csv) so we can compare the predicitons to the real results.

The data from the 2024 season was available here: https://github.com/toUpperCase78/formula1-datasets/blob/master/Formula1_2024season_raceResults.csv

The RESULTS.csv, STARTING_GRID.csv, QUALIFYING.csv were scraped from the official Formula 1 website. The scraping was done here: Data Scraping For detailed description of the datasets I have provided a Data Dictionary here: Data Dictionary

Loading the data:

In [2]:
import pandas as pd
df = pd.read_csv('Formula1_2024season_raceResults.csv')

Sampling the data:¶

Let's get a better idea of what the the dataset is by printing the first 5 lines...

In [3]:
df.head(5)
Out[3]:
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time
0 Bahrain 1 1 Max Verstappen Red Bull Racing Honda RBPT 1 57 1:31:44.742 26 Yes 1:32.608
1 Bahrain 2 11 Sergio Perez Red Bull Racing Honda RBPT 5 57 +22.457 18 No 1:34.364
2 Bahrain 3 55 Carlos Sainz Ferrari 4 57 +25.110 15 No 1:34.507
3 Bahrain 4 16 Charles Leclerc Ferrari 2 57 +39.669 12 No 1:34.090
4 Bahrain 5 63 George Russell Mercedes 3 57 +46.788 10 No 1:35.065

... and the last 5 lines

In [4]:
df.tail(5)
Out[4]:
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time
474 Abu Dhabi 16 20 Kevin Magnussen Haas Ferrari 14 57 +1 lap 0 Yes 1:25.637
475 Abu Dhabi 17 30 Liam Lawson RB Honda RBPT 12 55 DNF 0 No 1:28.751
476 Abu Dhabi NC 77 Valtteri Bottas Kick Sauber Ferrari 9 30 DNF 0 No 1:29.482
477 Abu Dhabi NC 43 Franco Colapinto Williams Mercedes 20 26 DNF 0 No 1:29.411
478 Abu Dhabi NC 11 Sergio Perez Red Bull Racing Honda RBPT 10 0 DNF 0 No NaN

The race ends when the first racer completes the needed number of laps, and then everyone following him finishes theirs. That’s why all the racers after the first one are recorded as "+SS.sss", and the winner is recorded as “H:MM:SS.sss”. In order for me to use the data, I will need to process all the times to be following the same format. Some racers however, get lapped by the winner (the winner manages to complete a full lap around them), which means that when the race finishes, a number of the racers haven’t completed the full number of laps. They are being recorded as “+1 Lap”, “+2 Laps”. There are also the drivers listed as NC (Not Classified) as their final positions – this happens when a driver starts the race, but doesn’t complete at least 90% of it (often caused by accidents or machine failures which in the results is being listed as DNF (Did Not Finish), DSQ (Disqualified) or rarely even DNS (Did Not Start)).

Preprocessing¶

Let's make sure that all the racers that have completed the required amount of laps have their times recorded in the same format: “H:MM:SS.sss” (Here, we are going to call that "Absolute Time"). In the case of a driver having to complete more laps after the race is finished, we are gonna record them as PF (Premature Finish) + the number of laps they haven't completed. If a racer doesn't finish they will be classified as DNF (Regardless of the reason - DNF, DSQ, DNS).

In [5]:
absolute_times = []

for _, row in df.iterrows():
    time_str = str(row["Time/Retired"]).strip()
    
    if time_str.startswith("+"):
        if time_str[1:].split()[0].isdigit():
            absolute_times.append(f"PF {time_str.split()[0]} lap")
            continue

        winner_mask = (df["Track"] == row["Track"]) & (df["Position"].astype(str).str.strip() == "1")
        
        if winner_mask.any():
            winner_time_str = str(df.loc[winner_mask, "Time/Retired"].values[0])
            
            if ":" in winner_time_str:
                winner_parts = list(map(float, winner_time_str.split(":")))
                winner_total = winner_parts[0]*3600 + winner_parts[1]*60 + winner_parts[2] if len(winner_parts) == 3 else winner_parts[0]*60 + winner_parts[1]
                
                try:
                    delta = float(time_str[1:])
                    absolute_total = winner_total + delta
                    hours = int(absolute_total // 3600)
                    minutes = int((absolute_total % 3600) // 60)
                    seconds = absolute_total % 60
                    absolute_time = f"{hours}:{minutes:02d}:{seconds:06.3f}"
                    absolute_times.append(absolute_time)
                except:
                    absolute_times.append("Time calculation error")
            else:
                absolute_times.append("Invalid winner time")
        else:
            absolute_times.append("No winner found")
    else:
        absolute_times.append(time_str)

df["Absolute Time"] = absolute_times

Let's take a look at how the data looks now. There is a new column with the Absoulte Time where they follow the same format.

In [6]:
df.head(10)
Out[6]:
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time Absolute Time
0 Bahrain 1 1 Max Verstappen Red Bull Racing Honda RBPT 1 57 1:31:44.742 26 Yes 1:32.608 1:31:44.742
1 Bahrain 2 11 Sergio Perez Red Bull Racing Honda RBPT 5 57 +22.457 18 No 1:34.364 1:32:07.199
2 Bahrain 3 55 Carlos Sainz Ferrari 4 57 +25.110 15 No 1:34.507 1:32:09.852
3 Bahrain 4 16 Charles Leclerc Ferrari 2 57 +39.669 12 No 1:34.090 1:32:24.411
4 Bahrain 5 63 George Russell Mercedes 3 57 +46.788 10 No 1:35.065 1:32:31.530
5 Bahrain 6 4 Lando Norris McLaren Mercedes 7 57 +48.458 8 No 1:34.476 1:32:33.200
6 Bahrain 7 44 Lewis Hamilton Mercedes 9 57 +50.324 6 No 1:34.722 1:32:35.066
7 Bahrain 8 81 Oscar Piastri McLaren Mercedes 8 57 +56.082 4 No 1:34.983 1:32:40.824
8 Bahrain 9 14 Fernando Alonso Aston Martin Aramco Mercedes 6 57 +74.887 2 No 1:34.199 1:32:59.629
9 Bahrain 10 18 Lance Stroll Aston Martin Aramco Mercedes 12 57 +93.216 1 No 1:35.632 1:33:17.958
In [7]:
df.tail(10)
Out[7]:
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time Absolute Time
469 Abu Dhabi 11 23 Alexander Albon Williams Mercedes 18 57 +1 lap 0 No 1:29.438 PF +1 lap
470 Abu Dhabi 12 22 Yuki Tsunoda RB Honda RBPT 11 57 +1 lap 0 No 1:29.200 PF +1 lap
471 Abu Dhabi 13 24 Guanyu Zhou Kick Sauber Ferrari 15 57 +1 lap 0 No 1:27.982 PF +1 lap
472 Abu Dhabi 14 18 Lance Stroll Aston Martin Aramco Mercedes 13 57 +1 lap 0 No 1:28.604 PF +1 lap
473 Abu Dhabi 15 61 Jack Doohan Alpine Renault 17 57 +1 lap 0 No 1:29.121 PF +1 lap
474 Abu Dhabi 16 20 Kevin Magnussen Haas Ferrari 14 57 +1 lap 0 Yes 1:25.637 PF +1 lap
475 Abu Dhabi 17 30 Liam Lawson RB Honda RBPT 12 55 DNF 0 No 1:28.751 DNF
476 Abu Dhabi NC 77 Valtteri Bottas Kick Sauber Ferrari 9 30 DNF 0 No 1:29.482 DNF
477 Abu Dhabi NC 43 Franco Colapinto Williams Mercedes 20 26 DNF 0 No 1:29.411 DNF
478 Abu Dhabi NC 11 Sergio Perez Red Bull Racing Honda RBPT 10 0 DNF 0 No NaN DNF

When the time isn't available, under "Absolute Time" we either have DNF or The Laps they needed to complete.

Formula 1 is a pretty expensive sport, which relies heavily on sponsorships, which change quite often and subsequently so do the teams' names. There are two types of teams: Works Teams (aka Factory Teams) and Customer Teams. The Factory Teams produce their own cars and engines (Ferrari, Mercedes, etc.), while the Customer teams have to buy their engines (or other parts) from the Works teams. If a team like Haas uses a engine supplied to them by Ferrari, then they will need to carry the Ferrari name as their sponsor. That's how we end up with teams like Aston Martin Aramco Mercedes and Kick Sauber Ferrari. In order to avoid confusion I will adress the teams by their shortened, most simple names. Some teams have renamed themselves for other sponsoring reasons, so I will use their 2025 names.

In [8]:
team_name_mapping = {
    'McLaren Mercedes': 'McLaren',
    'Mercedes': 'Mercedes',
    'Red Bull Racing Honda RBPT': 'Red Bull Racing',
    'Ferrari': 'Ferrari',
    'RB Honda RBPT': 'Racing Bulls',
    'Williams Mercedes': 'Williams',
    'Haas Ferrari': 'Haas',
    'Kick Sauber Ferrari': 'Kick Sauber',
    'Aston Martin Aramco Mercedes': 'Aston Martin',
    'Alpine Renault': 'Alpine'
}
df['Team'] = df['Team'].replace(team_name_mapping)

Now we can see that our dataset is using the current team names:

In [9]:
print(df['Team'].unique())
['Red Bull Racing' 'Ferrari' 'Mercedes' 'McLaren' 'Aston Martin'
 'Kick Sauber' 'Haas' 'Racing Bulls' 'Williams' 'Alpine']

Cleaning¶

We need to make sure that there are no empty rows or invalid data. Here is what the data looks before cleaning:

In [10]:
df.head(15)
Out[10]:
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time Absolute Time
0 Bahrain 1 1 Max Verstappen Red Bull Racing 1 57 1:31:44.742 26 Yes 1:32.608 1:31:44.742
1 Bahrain 2 11 Sergio Perez Red Bull Racing 5 57 +22.457 18 No 1:34.364 1:32:07.199
2 Bahrain 3 55 Carlos Sainz Ferrari 4 57 +25.110 15 No 1:34.507 1:32:09.852
3 Bahrain 4 16 Charles Leclerc Ferrari 2 57 +39.669 12 No 1:34.090 1:32:24.411
4 Bahrain 5 63 George Russell Mercedes 3 57 +46.788 10 No 1:35.065 1:32:31.530
5 Bahrain 6 4 Lando Norris McLaren 7 57 +48.458 8 No 1:34.476 1:32:33.200
6 Bahrain 7 44 Lewis Hamilton Mercedes 9 57 +50.324 6 No 1:34.722 1:32:35.066
7 Bahrain 8 81 Oscar Piastri McLaren 8 57 +56.082 4 No 1:34.983 1:32:40.824
8 Bahrain 9 14 Fernando Alonso Aston Martin 6 57 +74.887 2 No 1:34.199 1:32:59.629
9 Bahrain 10 18 Lance Stroll Aston Martin 12 57 +93.216 1 No 1:35.632 1:33:17.958
10 Bahrain 11 24 Guanyu Zhou Kick Sauber 17 56 +1 lap 0 No 1:35.458 PF +1 lap
11 Bahrain 12 20 Kevin Magnussen Haas 15 56 +1 lap 0 No 1:35.570 PF +1 lap
12 Bahrain 13 3 Daniel Ricciardo Racing Bulls 14 56 +1 lap 0 No 1:35.163 PF +1 lap
13 Bahrain 14 22 Yuki Tsunoda Racing Bulls 11 56 +1 lap 0 No 1:35.833 PF +1 lap
14 Bahrain 15 23 Alexander Albon Williams 13 56 +1 lap 0 No 1:35.723 PF +1 lap

Let's check if there are any rows with empty data:

In [11]:
df.isnull().sum()
Out[11]:
Track                0
Position             0
No                   0
Driver               0
Team                 0
Starting Grid        0
Laps                 0
Time/Retired         0
Points               0
Set Fastest Lap      0
Fastest Lap Time    16
Absolute Time        0
dtype: int64

We can see that the only column that has any empty data is "Fastest Lap Time". Let's take a closer look:

In [12]:
missing_fastest_lap = df[df["Fastest Lap Time"].isnull()]
display(missing_fastest_lap)
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time Absolute Time
39 Saudi Arabia NC 10 Pierre Gasly Alpine 18 1 DNF 0 No NaN DNF
77 Japan NC 3 Daniel Ricciardo Racing Bulls 11 0 DNF 0 No NaN DNF
78 Japan NC 23 Alexander Albon Williams 14 0 DNF 0 No NaN DNF
155 Monaco NC 31 Esteban Ocon Alpine 11 0 DNF 0 No NaN DNF
156 Monaco NC 11 Sergio Perez Red Bull Racing 16 0 DNF 0 No NaN DNF
157 Monaco NC 27 Nico Hulkenberg Haas 19 0 DNF 0 No NaN DNF
158 Monaco NC 20 Kevin Magnussen Haas 20 0 DNF 0 No NaN DNF
238 Great Britain NC 10 Pierre Gasly Alpine 19 0 DNS 0 No NaN DNS
378 United States NC 44 Lewis Hamilton Mercedes 17 1 DNF 0 No NaN DNF
397 Mexico NC 23 Alexander Albon Williams 9 0 DNF 0 No NaN DNF
398 Mexico NC 22 Yuki Tsunoda Racing Bulls 11 0 DNF 0 No NaN DNF
416 Brazil NC 23 Alexander Albon Williams 7 0 DNS 0 No NaN DNS
417 Brazil NC 18 Lance Stroll Aston Martin 10 0 DNS 0 No NaN DNS
457 Qatar NC 43 Franco Colapinto Williams 19 0 DNF 0 No NaN DNF
458 Qatar NC 31 Esteban Ocon Alpine 20 0 DNF 0 No NaN DNF
478 Abu Dhabi NC 11 Sergio Perez Red Bull Racing 10 0 DNF 0 No NaN DNF

We can see that the data here is all NC as their position and they all were retired as DNF or DNS. Those rows are all results of the drivers crashing, failing to start or a different type of emergency, which exaplins why they don't have a "Fastest Lap Time". We could either delete these rows or update the data from "NaN" to "No time". I think the latter is a better option since I wanna keep as much data as possible for the analysis later on.

In [13]:
df["Fastest Lap Time"] = df["Fastest Lap Time"].fillna("No time")
missing_fastest_lap = df[df["Fastest Lap Time"] == "No time"]
display(missing_fastest_lap)
Track Position No Driver Team Starting Grid Laps Time/Retired Points Set Fastest Lap Fastest Lap Time Absolute Time
39 Saudi Arabia NC 10 Pierre Gasly Alpine 18 1 DNF 0 No No time DNF
77 Japan NC 3 Daniel Ricciardo Racing Bulls 11 0 DNF 0 No No time DNF
78 Japan NC 23 Alexander Albon Williams 14 0 DNF 0 No No time DNF
155 Monaco NC 31 Esteban Ocon Alpine 11 0 DNF 0 No No time DNF
156 Monaco NC 11 Sergio Perez Red Bull Racing 16 0 DNF 0 No No time DNF
157 Monaco NC 27 Nico Hulkenberg Haas 19 0 DNF 0 No No time DNF
158 Monaco NC 20 Kevin Magnussen Haas 20 0 DNF 0 No No time DNF
238 Great Britain NC 10 Pierre Gasly Alpine 19 0 DNS 0 No No time DNS
378 United States NC 44 Lewis Hamilton Mercedes 17 1 DNF 0 No No time DNF
397 Mexico NC 23 Alexander Albon Williams 9 0 DNF 0 No No time DNF
398 Mexico NC 22 Yuki Tsunoda Racing Bulls 11 0 DNF 0 No No time DNF
416 Brazil NC 23 Alexander Albon Williams 7 0 DNS 0 No No time DNS
417 Brazil NC 18 Lance Stroll Aston Martin 10 0 DNS 0 No No time DNS
457 Qatar NC 43 Franco Colapinto Williams 19 0 DNF 0 No No time DNF
458 Qatar NC 31 Esteban Ocon Alpine 20 0 DNF 0 No No time DNF
478 Abu Dhabi NC 11 Sergio Perez Red Bull Racing 10 0 DNF 0 No No time DNF

Now let's check if there are any null values left in our data:

In [14]:
df.isnull().sum()
Out[14]:
Track               0
Position            0
No                  0
Driver              0
Team                0
Starting Grid       0
Laps                0
Time/Retired        0
Points              0
Set Fastest Lap     0
Fastest Lap Time    0
Absolute Time       0
dtype: int64

As we can see, there are no more null values in our dataset. It's clean, which means that now we can dive a little bit deeper into what the data actually represents. A good way of doing that is through visualisation.

Visualisation¶

The first thing we might want to look into is the number of races each driver has won:

In [15]:
import pandas as pd
import matplotlib.pyplot as plt

all_drivers = df['Driver'].unique()

wins = df[df['Position'] == '1']
win_counts = wins['Driver'].value_counts()

full_win_counts = pd.Series(0, index=all_drivers)
full_win_counts.update(win_counts)

full_win_counts = full_win_counts.sort_values(ascending=False)

plt.figure(figsize=(12, 8))
bars = plt.bar(full_win_counts.index, full_win_counts.values, color='gold', edgecolor='black')

plt.title('Wins per Driver', fontsize=16)
plt.xlabel('Driver', fontsize=12)
plt.ylabel('Number of Wins', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)

for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}',
             ha='center', va='bottom')

plt.tight_layout()
plt.show()
No description has been provided for this image

From the graph above one may suggest that Max Verstappen exercises considerable dominance over the other drviers. It this true? To get a more detailed perspective on how different the racers are from each other, it a good idea to look at their average fastest laps (The average of their best performance or each track), since only looking at the wins may be deceptive.

In [16]:
df_laps = df.copy()
df_laps = df_laps[df_laps['Fastest Lap Time'] != 'No time']

def lap_time_to_seconds(x):
    minutes, seconds = x.split(':')
    return int(minutes) * 60 + float(seconds)

df_laps['Fastest Lap Time (s)'] = df_laps['Fastest Lap Time'].apply(lap_time_to_seconds)
avg_fastest_lap_driver = df_laps.groupby('Driver')['Fastest Lap Time (s)'].mean().sort_values()
def seconds_to_lap_time(seconds):
    minutes = int(seconds // 60)
    secs = seconds % 60
    return f"{minutes}:{secs:06.3f}"

plt.figure(figsize=(12,10))
bars = avg_fastest_lap_driver.plot(kind='barh', color='orange')
plt.xlabel('Average Fastest Lap Time (seconds)')
plt.title('Average Fastest Lap Time per Driver')
plt.grid(axis='x')
plt.gca().invert_yaxis()

for index, value in enumerate(avg_fastest_lap_driver):
    lap_time_formatted = seconds_to_lap_time(value)
    plt.text(value + 0.5, index, lap_time_formatted, va='center')
plt.tight_layout()
plt.show()
No description has been provided for this image

As we can see, the difference from driver to driver often is a fragment of a second, meaning the competition is far more fierce than the amount of wins per driver may suggest. Having that data we can also show the number of wins per Team:

In [17]:
all_teams = df['Team'].unique()
team_wins = df[df['Position'] == '1']
team_win_counts = team_wins['Team'].value_counts()
full_team_win_counts = pd.Series(0, index=all_teams)
full_team_win_counts.update(team_win_counts)

full_team_win_counts = full_team_win_counts.sort_values(ascending=False)

plt.figure(figsize=(12, 8))
bars = plt.bar(full_team_win_counts.index, full_team_win_counts.values, 
               color='#ff0d0d', edgecolor='black')

plt.title('F1 Wins by Team', fontsize=16, pad=20)
plt.xlabel('Team', fontsize=12)
plt.ylabel('Number of Wins', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.3)

for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{int(height)}',
             ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()
No description has been provided for this image

"Red Bull Racing" is indeed Max Verstappen's team, and since he has the most wins individually, he brings his team up in the Constructors Championship as well. We can see that McLaren, Ferrari and Mercedes are going to be the main favourites to take the throne from Red Bull. Let's analyse the tracks themselves. It's interesting to see which tracks are the most problematic for the drivers. Which ones result it the most amounts of NCs?

In [18]:
nc_df = df[df['Position'] == 'NC']
nc_counts = nc_df['Track'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(10,6))
nc_counts.plot(kind='bar', color='red')
plt.title('Number of NCs per Track')
plt.xlabel('Track')
plt.ylabel('Number of NCs')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
No description has been provided for this image

Here, we can see that Canada and Qatar are the most problematic for drivers. As we saw before, some of the racers DO finish their races, however since they get lapped, they don't complete the full number of laps needed. Let's look into who these drivers are:

In [19]:
all_drivers = df['Driver'].unique()
pf_laps = df[df['Absolute Time'].str.contains('PF', na=False)].copy()
pf_laps['PF Status'] = pf_laps['Absolute Time'].str.extract(r'(PF \+\d+ laps?)')
pf_counts = pf_laps.groupby(['Driver', 'PF Status']).size().unstack(fill_value=0)
pf_counts = pf_counts.reindex(all_drivers, fill_value=0)
pf_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='coolwarm')
plt.title("Number of Times Each Driver Was Lapped")
plt.ylabel("Count")
plt.xlabel("Driver")
plt.xticks(rotation=45, ha='right')
plt.legend(title="Lapping Status")
plt.tight_layout()
plt.show()
No description has been provided for this image

From the graph above we can make the conclusion that most drivers get lapped once, more rarely twice. An outlier is Lando Norris who got lapped 7 times. Let's take a look at the track themselves. It's interesting to see which ones require more time. We can do that by taking the average of the fastest laps of each racer for every track:

In [20]:
import pandas as pd
import matplotlib.pyplot as plt

df_laps = df.copy()
df_laps = df_laps[df_laps['Fastest Lap Time'] != 'No time']

def lap_time_to_seconds(x):
    minutes, seconds = x.split(':')
    return int(minutes) * 60 + float(seconds)

df_laps['Fastest Lap Time (s)'] = df_laps['Fastest Lap Time'].apply(lap_time_to_seconds)

avg_fastest_lap = df_laps.groupby('Track')['Fastest Lap Time (s)'].mean().sort_values()

def seconds_to_lap_time(seconds):
    minutes = int(seconds // 60)
    secs = seconds % 60
    return f"{minutes}:{secs:06.3f}"

plt.figure(figsize=(12,7))
bars = avg_fastest_lap.plot(kind='barh', color='skyblue')
plt.xlabel('Average Fastest Lap Time (seconds)')
plt.title('Average Fastest Lap Time per Race Track')
plt.grid(axis='x')

plt.gca().invert_yaxis()

for index, value in enumerate(avg_fastest_lap):
    lap_time_formatted = seconds_to_lap_time(value)
    plt.text(value + 0.5, index, lap_time_formatted, va='center')

plt.tight_layout()
plt.show()
No description has been provided for this image

Now that we have taken a closer look at the tracks, lets see how each driver performs on each track:

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

df_race = df.copy()

df_race['Position'] = pd.to_numeric(df_race['Position'], errors='coerce')

NC_VALUE = 25
df_race['Position_Filled'] = df_race['Position'].fillna(NC_VALUE)

heatmap_data = df_race.pivot_table(index='Driver', columns='Track', values='Position_Filled')
annot_data = heatmap_data.applymap(
    lambda x: "NC" if x == NC_VALUE else (f"{int(x)}" if not pd.isna(x) else "")
)

# Plot heatmap
plt.figure(figsize=(16, 10))
sns.heatmap(
    heatmap_data,
    annot=annot_data,
    fmt="",
    cmap='YlGnBu_r',
    linewidths=0.5,
    linecolor='gray',
    cbar_kws={'label': 'Finishing Position'},
    vmin=1,
    vmax=20
)

plt.title('Finishing Positions of Drivers by Race')
plt.xlabel('Track')
plt.ylabel('Driver')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
No description has been provided for this image

If we look into the graph above, the more popular drivers tend to have a row, which is almost exclusively blue (the colour blue indicates a higher finish). If we look into the drivers with consistently blue rows, we can see that they tend to be members of the "contender" teams from the graph above: Mercedes, Mclaren, Ferrari, Red Bull. Something important to look into is how positions change and what's the trend behind it.

In [22]:
import seaborn as sns

df_grid = df.copy()

df_grid['Grid'] = pd.to_numeric(df_grid['Starting Grid'], errors='coerce')
df_grid['Position'] = pd.to_numeric(df_grid['Position'], errors='coerce')

df_grid = df_grid.dropna(subset=['Grid', 'Position'])

plt.figure(figsize=(10, 8))
sns.scatterplot(data=df_grid, x='Grid', y='Position', hue='Driver', alpha=0.7, legend=False)

# Diagonal reference line (Grid == Position)
plt.plot([1, 20], [1, 20], 'r--', label='Same Start/Finish')

# Regression line
sns.regplot(data=df_grid, x='Grid', y='Position', scatter=False, color='blue', label='Trend Line')

plt.xlim(1, 20)
plt.ylim(1, 20)
plt.xticks(range(1, 21))
plt.yticks(range(1, 21))

plt.xlabel('Starting Grid Position')
plt.ylabel('Finishing Position')
plt.title('Starting Grid Position vs Finishing Position')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

The red dotted line represents all the results of racer who started and edned the race in the same position. The blue line represents the general trend in the dataset in the realationship between the starting grid and the finishing positions. We can see that the two of them cross at the 8x8 dot. We can draw the conclusion that racers starting at positions from 1-7 tend to finish lower in the rankings and the racers from 9-20 normally have a better chance at upgrading their positions.

Processing¶

Before we move on to select the features for the model, we need to do some more processing. For my "Absolute Time" and "Fastest Lap Time" I will be working in seconds.

In [23]:
def convert_lap_time_to_seconds(lap_time_str):
    if isinstance(lap_time_str, str) and ':' in lap_time_str:
        minutes, seconds = lap_time_str.split(':')
        return int(minutes) * 60 + float(seconds)
    return None

df['Fastest Lap Time (s)'] = df['Fastest Lap Time'].apply(convert_lap_time_to_seconds)

def convert_absolute_time_to_seconds(time_str):
    if isinstance(time_str, str) and ':' in time_str:
        try:
            h, m, s = time_str.split(':')
            return int(h) * 3600 + int(m) * 60 + float(s)
        except:
            return None
    return None


numeric_cols = ['Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary)', 'Position']

df['Set Fastest Lap (binary)'] = df['Set Fastest Lap'].map({'Yes': 1, 'No': 0})


for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

df_cleaned = df[numeric_cols].dropna()
import seaborn as sns
import matplotlib.pyplot as plt

Qualification Dataset¶

For optimal detail in my prediction results, I will be using the qualifying for each race to make the predictions for said race. The information for the qualifications is publicly available at the official Formula 1 site, where I have scraped the data from to make my dataset. Let's take a look at the data:

In [24]:
qualifying_df = pd.read_csv("QUALIFYING.csv")
qualifying_df.head(20)
Out[24]:
Location Pos No Driver Car Q1 Q2 Q3 Laps Team
0 Australia 1 4 Lando Norris McLaren Mercedes 1:15.912 1:15.415 1:15.096 20 McLaren
1 Australia 2 81 Oscar Piastri McLaren Mercedes 1:16.062 1:15.468 1:15.180 18 McLaren
2 Australia 3 1 Max Verstappen Red Bull Racing Honda RBPT 1:16.018 1:15.565 1:15.481 17 Red Bull Racing
3 Australia 4 63 George Russell Mercedes 1:15.971 1:15.798 1:15.546 21 Mercedes
4 Australia 5 22 Yuki Tsunoda Racing Bulls Honda RBPT 1:16.225 1:16.009 1:15.670 18 Racing Bulls
5 Australia 6 23 Alexander Albon Williams Mercedes 1:16.245 1:16.017 1:15.737 21 Williams
6 Australia 7 16 Charles Leclerc Ferrari 1:16.029 1:15.827 1:15.755 20 Ferrari
7 Australia 8 44 Lewis Hamilton Ferrari 1:16.213 1:15.919 1:15.973 23 Ferrari
8 Australia 9 10 Pierre Gasly Alpine Renault 1:16.328 1:16.112 1:15.980 21 Alpine
9 Australia 10 55 Carlos Sainz Jr Williams Mercedes 1:16.360 1:15.931 1:16.062 21 Williams
10 Australia 11 6 Isack Hadjar Racing Bulls Honda RBPT 1:16.354 1:16.175 NaN 12 Racing Bulls
11 Australia 12 14 Fernando Alonso Aston Martin Aramco Mercedes 1:16.288 1:16.453 NaN 13 Aston Martin
12 Australia 13 18 Lance Stroll Aston Martin Aramco Mercedes 1:16.369 1:16.483 NaN 15 Aston Martin
13 Australia 14 7 Jack Doohan Alpine Renault 1:16.315 1:16.863 NaN 15 Alpine
14 Australia 15 5 Gabriel Bortoleto Kick Sauber Ferrari 1:16.516 1:17.520 NaN 13 Kick Sauber
15 Australia 16 12 Andrea Kimi Antonelli Mercedes 1:16.525 NaN NaN 9 Mercedes
16 Australia 17 27 Nico Hulkenberg Kick Sauber Ferrari 1:16.579 NaN NaN 9 Kick Sauber
17 Australia 18 30 Liam Lawson Red Bull Racing Honda RBPT 1:17.094 NaN NaN 7 Red Bull Racing
18 Australia 19 31 Esteban Ocon Haas Ferrari 1:17.147 NaN NaN 9 Haas
19 Australia NC 87 Oliver Bearman Haas Ferrari DNS NaN NaN 1 Haas

Something we can notice is that not all drivers have a time registered for Q2 or Q3 and the number of laps is uneven. Let me explain the reason behind this: The qualicifation process is divided in three rounds: Q1, Q2 and Q3. In Q1 all the racers take part, however not all of them get to participate in the next round - the slowest 5 are being eliminated. This process is repeated in Q2 going into Q3, the slowest 5 are eliminated. Q1 takes 18 minutes and has all 20 drivers, Q2 takes 15 minutes and has 15 drivers and lastly Q3 lasts 12 minutes and has the fastest 10 drivers. The drivers are being judged on the fastest time they can produce, which is being recorded for them in the respective qualifying period.

The model we created above and trained uses the following features: 'Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary). We can get the starting grid from the quali. results. The amount of laps each race has until it's finished is publicly available information, so we can "hard code" it per race. The Fastest Lap Time and whether or not a racer has set the fastest lap is also information we can get from the qualification results. The main idea now is to use the same features and give them to the model so that we can do the predictions. Before that though, we should do a bit of processing.

In [25]:
def time_to_seconds(time_str):
    try:
        minutes, rest = time_str.split(':')
        seconds = float(rest)
        return int(minutes) * 60 + seconds
    except:
        return "No Time"


for session in ['Q1', 'Q2', 'Q3']:
    qualifying_df[session] = qualifying_df[session].fillna("No Time")
    qualifying_df[session] = qualifying_df[session].apply(time_to_seconds)

qualifying_df.tail(20)
Out[25]:
Location Pos No Driver Car Q1 Q2 Q3 Laps Team
100 Miami 1 1 Max Verstappen Red Bull Racing Honda RBPT 86.87 86.643 86.204 18 Red Bull Racing
101 Miami 2 4 Lando Norris McLaren Mercedes 86.955 86.499 86.269 21 McLaren
102 Miami 3 12 Andrea Kimi Antonelli Mercedes 87.077 86.606 86.271 20 Mercedes
103 Miami 4 81 Oscar Piastri McLaren Mercedes 87.006 86.269 86.375 16 McLaren
104 Miami 5 63 George Russell Mercedes 87.014 86.575 86.385 20 Mercedes
105 Miami 6 55 Carlos Sainz Jr Williams Mercedes 87.098 86.847 86.569 20 Williams
106 Miami 7 23 Alexander Albon Williams Mercedes 87.042 86.855 86.682 20 Williams
107 Miami 8 16 Charles Leclerc Ferrari 87.417 86.948 86.754 20 Ferrari
108 Miami 9 31 Esteban Ocon Haas Ferrari 87.45 86.967 86.824 21 Haas
109 Miami 10 22 Yuki Tsunoda Red Bull Racing Honda RBPT 87.298 86.959 86.943 21 Red Bull Racing
110 Miami 11 6 Isack Hadjar Racing Bulls Honda RBPT 87.301 86.987 No Time 13 Racing Bulls
111 Miami 12 44 Lewis Hamilton Ferrari 87.279 87.006 No Time 15 Ferrari
112 Miami 13 5 Gabriel Bortoleto Kick Sauber Ferrari 87.343 87.151 No Time 15 Kick Sauber
113 Miami 14 7 Jack Doohan Alpine Renault 87.422 87.186 No Time 15 Alpine
114 Miami 15 30 Liam Lawson Racing Bulls Honda RBPT 87.444 87.363 No Time 14 Racing Bulls
115 Miami 16 27 Nico Hulkenberg Kick Sauber Ferrari 87.473 No Time No Time 9 Kick Sauber
116 Miami 17 14 Fernando Alonso Aston Martin Aramco Mercedes 87.604 No Time No Time 9 Aston Martin
117 Miami 18 10 Pierre Gasly Alpine Renault 87.71 No Time No Time 9 Alpine
118 Miami 19 18 Lance Stroll Aston Martin Aramco Mercedes 87.83 No Time No Time 9 Aston Martin
119 Miami 20 87 Oliver Bearman Haas Ferrari 87.999 No Time No Time 9 Haas

Feature Selection¶

Here are the features that I will be using in my model: Starting Grid, Laps, Fastest Time and Absoulte Time (both in seconds) and Set Fastest Lap (binary). The ones that I am not using are: No, Driver, Team, Time/Retired, Points. What number the driver is doesn't affect their performance, just like their own name or the one of their team. I am already using the "Absolute Time", so there is no need for "Time Retired" and the points are a result of the race, after the fact, not something that could be used for a prediction. Let's look into how the selected features work with one another. I have used a heatmap and a scatter matrix.

In [26]:
plt.figure(figsize=(10, 6))
sns.heatmap(df_cleaned[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Formula 1 Metrics")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [27]:
sns.pairplot(df_cleaned, corner=False, plot_kws={'alpha': 0.6, 's': 40})
plt.suptitle("Scatter Matrix of Formula 1 Features", y=1.02)
plt.show()
No description has been provided for this image

Splitting into Train/Test¶

In [28]:
from sklearn.model_selection import train_test_split

X = df_cleaned[['Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary)']]
y = df_cleaned['Position']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 431 observations, of which 344 are now in the train set, and 87 in the test set.

Scaling¶

In [29]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Modeling¶

For this project I will go through a couple of model types to see which one will work best. I want to go through: K-Nearest Neighbours, Linear Regression, Descision Trees, Support Vector Machines and Random Forest model types. After we look at the base model, we can enhanse it with hyperparameter tuning or boosting with AdaBoost.

K-Nearest Neighbours:

In [30]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

y_train_clean = y_train.copy()
y_test_clean = y_test.copy()

knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train_scaled, y_train_clean)
y_pred_knn = knn_regressor.predict(X_test_scaled)
mae_knn = mean_absolute_error(y_test_clean, y_pred_knn)
mse_knn = mean_squared_error(y_test_clean, y_pred_knn)
r2_knn = r2_score(y_test_clean, y_pred_knn)

print(f"KNN Mean Absolute Error (MAE): {mae_knn}")
print(f"KNN Mean Squared Error (MSE): {mse_knn}")
print(f"KNN R-squared (R²): {r2_knn}")
KNN Mean Absolute Error (MAE): 3.2896551724137924
KNN Mean Squared Error (MSE): 20.878620689655172
KNN R-squared (R²): 0.30693945214851437

The R² here is around 30 %, which could definetly be improved upon, let's try doing so with hyperparameter tuning:

After hyperparameter tuning:

In [31]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'p': [1, 2]
}

knn = KNeighborsRegressor()

grid_search = GridSearchCV(
    estimator=knn,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train_clean)
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)

print("Best hyperparameters:", grid_search.best_params_)
print("Tuned MAE:", mean_absolute_error(y_test_clean, y_pred_best))
print("Tuned MSE:", mean_squared_error(y_test_clean, y_pred_best))
print("Tuned R²:", r2_score(y_test_clean, y_pred_best))
Best hyperparameters: {'n_neighbors': 9, 'p': 2, 'weights': 'uniform'}
Tuned MAE: 3.0434227330779047
Tuned MSE: 19.161770966368664
Tuned R²: 0.3639298456944432

Now the R² has improved with 6%, however that still isn't what we are gunning for. Let's try some other models.

Linear Regression:

In [32]:
from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train_clean)
y_pred_linear = linear_model.predict(X_test_scaled)

mae_linear = mean_absolute_error(y_test_clean, y_pred_linear)
mse_linear = mean_squared_error(y_test_clean, y_pred_linear)
r2_linear = r2_score(y_test_clean, y_pred_linear)

print("Linear Regression Metrics:")
print(f"MAE: {mae_linear}")
print(f"MSE: {mse_linear}")
print(f"R²: {r2_linear}")
Linear Regression Metrics:
MAE: 2.8148606730396013
MSE: 17.51357331379357
R²: 0.418641351068321

This linear regression model has a R² of 41 %, making it a better option that the K-Nearest neighbours. Can we expand upon it though? I will use AdaBoost to try to get an even higher result:

In [33]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LinearRegression

base_model = LinearRegression()

adaboost_model = AdaBoostRegressor(base_model, n_estimators=50, random_state=1)

adaboost_model.fit(X_train_scaled, y_train_clean)

y_pred_adaboost = adaboost_model.predict(X_test_scaled)

mae_adaboost = mean_absolute_error(y_test_clean, y_pred_adaboost)
mse_adaboost = mean_squared_error(y_test_clean, y_pred_adaboost)
r2_adaboost = r2_score(y_test_clean, y_pred_adaboost)

print("AdaBoost with Linear Regression Metrics:")
print(f"MAE: {mae_adaboost}")
print(f"MSE: {mse_adaboost}")
print(f"R²: {r2_adaboost}")
AdaBoost with Linear Regression Metrics:
MAE: 2.85748642122476
MSE: 16.71738732850535
R²: 0.4450705440383784

44% is the highest R² so far. Let's look at more model types:

Decision Trees (Regressor):

In [34]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

tree_model = DecisionTreeRegressor(random_state=1)
tree_model.fit(X_train_scaled, y_train_clean)

y_pred_tree = tree_model.predict(X_test_scaled)

mae_tree = mean_absolute_error(y_test_clean, y_pred_tree)
mse_tree = mean_squared_error(y_test_clean, y_pred_tree)
r2_tree = r2_score(y_test_clean, y_pred_tree)

print("Decision Tree Regressor Metrics:")
print(f"MAE: {mae_tree}")
print(f"MSE: {mse_tree}")
print(f"R²: {r2_tree}")
Decision Tree Regressor Metrics:
MAE: 3.781609195402299
MSE: 24.93103448275862
R²: 0.1724205983738124

In this model we get an R² of 17%, which is surprisingly low. The cause for that may be over/under fitting of the model. Let's try to fix that with some more hyperparameter tuning.

In [35]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

param_grid = {
    'max_depth': [3],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

tree = DecisionTreeRegressor(random_state=1)

grid_search = GridSearchCV(
    estimator=tree,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train_clean)
best_tree = grid_search.best_estimator_
y_pred_best_tree = best_tree.predict(X_test_scaled)

print("Best Parameters:", grid_search.best_params_)
print("Tuned MAE:", mean_absolute_error(y_test_clean, y_pred_best_tree))
print("Tuned MSE:", mean_squared_error(y_test_clean, y_pred_best_tree))
print("Tuned R²:", r2_score(y_test_clean, y_pred_best_tree))
Best Parameters: {'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 2}
Tuned MAE: 3.1079737085862322
Tuned MSE: 18.65229382431931
Tuned R²: 0.38084181092601066

We can see a jump from 17 to 38%, which is quite significant. This was likely caused by overfitting, which can be avoided by adding a max_depth (in this case the optimal value is 3). Let's look at more:

Support Vector Regressor:

In [36]:
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

svr_model = SVR()
svr_model.fit(X_train_scaled, y_train_clean)

y_pred_svr = svr_model.predict(X_test_scaled)

mae_svr = mean_absolute_error(y_test_clean, y_pred_svr)
mse_svr = mean_squared_error(y_test_clean, y_pred_svr)
r2_svr = r2_score(y_test_clean, y_pred_svr)

print("Support Vector Regressor Metrics:")
print(f"MAE: {mae_svr}")
print(f"MSE: {mse_svr}")
print(f"R²: {r2_svr}")
Support Vector Regressor Metrics:
MAE: 2.851083430866835
MSE: 17.191099331171962
R²: 0.42934579358804736

This one gives us a pretty solid result compared to the others, however I do believe that we can go higher than that.

Random forest:

In [37]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

rf_model = RandomForestRegressor(n_estimators=100, max_depth=None, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train_clean)

y_pred_rf = rf_model.predict(X_test_scaled)

mae_rf = mean_absolute_error(y_test_clean, y_pred_rf)
mse_rf = mean_squared_error(y_test_clean, y_pred_rf)
r2_rf = r2_score(y_test_clean, y_pred_rf)

print("Random Forest Regressor Metrics:")
print(f"MAE: {mae_rf}")
print(f"MSE: {mse_rf}")
print(f"R²: {r2_rf}")
Random Forest Regressor Metrics:
MAE: 2.929655172413793
MSE: 17.43067816091954
R²: 0.42139303476041345

The Random forest base model gives us a result of 42% as well, let's try to enhance it to see how high we can get it:

In [38]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}

rf_model = RandomForestRegressor(random_state=42, n_jobs=-1)

grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_absolute_error')

grid_search.fit(X_train_scaled, y_train_clean)

best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)

best_rf_model = grid_search.best_estimator_

y_pred_rf = best_rf_model.predict(X_test_scaled)

mae_rf = mean_absolute_error(y_test_clean, y_pred_rf)
mse_rf = mean_squared_error(y_test_clean, y_pred_rf)
r2_rf = r2_score(y_test_clean, y_pred_rf)

print("Random Forest Regressor Metrics with Hyperparameter Tuning:")
print(f"MAE: {mae_rf}")
print(f"MSE: {mse_rf}")
print(f"R²: {r2_rf}")
Best hyperparameters: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Random Forest Regressor Metrics with Hyperparameter Tuning:
MAE: 2.8414209708132074
MSE: 15.847555533117363
R²: 0.4739443910999773

After comparing the models, this last one gives us the highest result, meaning we are going to be using it for our predicitons.

Prediction for the Races¶

In [39]:
post_race_results_df = pd.read_csv("RESULTS.csv")
starting_grid_df = pd.read_csv("STARTING_GRID.csv")

all_predictions = []
results_for_csv = []

for location in qualifying_df['Location'].unique():
    print(f"\n🏁 Predictions for {location}\n")

    race_df = qualifying_df[qualifying_df['Location'] == location].copy()

    starting_grids = []

    for _, row in race_df.iterrows():
       driver = row['Driver']
       match = starting_grid_df[
        (starting_grid_df['Driver'] == driver) &
        (starting_grid_df['Location'] == location)
       ]
       if not match.empty:
        starting_grids.append(int(match.iloc[0]['Pos']))
       else:
        starting_grids.append(np.nan)

    race_df['Starting Grid'] = starting_grids
    race_df[['Q1', 'Q2', 'Q3']] = race_df[['Q1', 'Q2', 'Q3']].replace('No Time', np.nan).astype(float)
    race_df['Fastest Lap Time (s)'] = race_df[['Q1', 'Q2', 'Q3']].min(axis=1)
    race_df['Laps'] = 58
    race_df['Set Fastest Lap (binary)'] = 0

    fastest_driver_idx = race_df['Fastest Lap Time (s)'].idxmin()
    race_df.loc[fastest_driver_idx, 'Set Fastest Lap (binary)'] = 1

    features = ['Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary)']
    predictable_df = race_df.dropna(subset=features).copy()
    unpredictable_df = race_df[~race_df.index.isin(predictable_df.index)].copy()

    X_qual = predictable_df[features]
    X_qual_scaled = scaler.transform(X_qual)
    predicted_positions = best_rf_model.predict(X_qual_scaled)
    predictable_df['Predicted Position'] = predicted_positions
    predictable_df['Predicted Position'] = predictable_df['Predicted Position'].rank(method='first').astype(int)

    if not unpredictable_df.empty:
        start_rank = predictable_df['Predicted Position'].max() + 1
        unpredictable_df['Predicted Position'] = range(start_rank, start_rank + len(unpredictable_df))

    final_race_df = pd.concat([predictable_df, unpredictable_df], ignore_index=True)
    final_race_df = final_race_df.sort_values(by='Predicted Position').reset_index(drop=True)

    final_race_df['Predicted Position'] = final_race_df['Predicted Position'].astype(int)

    
    print(f"{'Driver':<25}{'Team':<20}{'Grid':<8}{'Predicted':<12}{'Actual':<10}{'Difference'}")

    for _, row in final_race_df.iterrows():
        driver = row['Driver']
        team = row['Team']
        grid = int(row['Starting Grid']) if pd.notna(row['Starting Grid']) else "N/A"
        predicted = row['Predicted Position']

        match = post_race_results_df[
            (post_race_results_df['Driver'] == driver) &
            (post_race_results_df['Location'] == location)
        ]
        actual = match.iloc[0]['Pos'] if not match.empty else "N/A"

        if actual == 'NC':
           difference = 'X'
           actual = 'NC'
        elif str(actual).upper() == 'DQ':
           difference = 'X'
           actual = 'DQ'
        elif str(actual).upper() in ['N/A']:
           difference = 'X'
           actual = 'N/A'
        else:
            try:
                actual = int(actual)
                predicted = int(row['Predicted Position'])
                if actual == predicted:
                 difference = '✔'
                elif actual < predicted:
                 difference = f"+{predicted - actual}"
                else:
                 difference = f"-{actual - predicted}"
            except:
                  difference = 'X'
                  actual = 'Error'


        results_for_csv.append({
            'Driver': driver,
            'Team': team,
            'Grid': grid,
            'Predicted': predicted,
            'Actual': actual,
            'Difference': difference,
            'Location': location
        })
        print(f"{driver:<25}{team:<20}{grid:<8}{predicted:<12}{actual:<10}{difference}")

    print("-" * 50)

    all_predictions.append(final_race_df)

    results_df = pd.DataFrame(results_for_csv)
    results_df.to_csv("website/data/final_individual_rankings.csv", index=False)
    import joblib
    joblib.dump(best_rf_model, "website/model/Individual_Predictions.pkl")
🏁 Predictions for Australia

Driver                   Team                Grid    Predicted   Actual    Difference
Oscar Piastri            McLaren             2       1           9         -8
Lando Norris             McLaren             1       2           1         +1
Max Verstappen           Red Bull Racing     3       3           2         +1
George Russell           Mercedes            4       4           3         +1
Yuki Tsunoda             Racing Bulls        5       5           12        -7
Alexander Albon          Williams            6       6           5         +1
Pierre Gasly             Alpine              9       7           11        -4
Charles Leclerc          Ferrari             7       8           8         ✔
Lewis Hamilton           Ferrari             8       9           10        -1
Carlos Sainz Jr          Williams            10      10          NC        X
Isack Hadjar             Racing Bulls        11      11          NC        X
Fernando Alonso          Aston Martin        12      12          NC        X
Lance Stroll             Aston Martin        13      13          6         +7
Liam Lawson              Red Bull Racing     18      14          NC        X
Esteban Ocon             Haas                19      15          13        +2
Gabriel Bortoleto        Kick Sauber         15      16          NC        X
Jack Doohan              Alpine              14      17          NC        X
Andrea Kimi Antonelli    Mercedes            16      18          4         +14
Nico Hulkenberg          Kick Sauber         17      19          7         +12
Oliver Bearman           Haas                20      20          14        +6
--------------------------------------------------

🏁 Predictions for China

Driver                   Team                Grid    Predicted   Actual    Difference
Oscar Piastri            McLaren             1       1           1         ✔
Lewis Hamilton           Ferrari             5       2           DQ        X
Lando Norris             McLaren             3       3           2         +1
Max Verstappen           Red Bull Racing     4       4           4         ✔
George Russell           Mercedes            2       5           3         +2
Isack Hadjar             Racing Bulls        7       6           11        -5
Andrea Kimi Antonelli    Mercedes            8       7           6         +1
Charles Leclerc          Ferrari             6       8           DQ        X
Yuki Tsunoda             Racing Bulls        9       9           16        -7
Alexander Albon          Williams            10      10          7         +3
Gabriel Bortoleto        Kick Sauber         19      11          14        -3
Carlos Sainz Jr          Williams            15      12          10        +2
Jack Doohan              Alpine              18      13          13        ✔
Fernando Alonso          Aston Martin        13      14          NC        X
Oliver Bearman           Haas                17      15          8         +7
Pierre Gasly             Alpine              16      16          DQ        X
Liam Lawson              Red Bull Racing     20      17          12        +5
Nico Hulkenberg          Kick Sauber         12      18          15        +3
Lance Stroll             Aston Martin        14      19          9         +10
Esteban Ocon             Haas                11      20          5         +15
--------------------------------------------------

🏁 Predictions for Japan

Driver                   Team                Grid    Predicted   Actual    Difference
Max Verstappen           Red Bull Racing     1       1           1         ✔
George Russell           Mercedes            5       2           5         -3
Oscar Piastri            McLaren             3       3           3         ✔
Charles Leclerc          Ferrari             4       4           4         ✔
Lando Norris             McLaren             2       5           2         +3
Isack Hadjar             Racing Bulls        7       6           8         -2
Lewis Hamilton           Ferrari             8       7           7         ✔
Andrea Kimi Antonelli    Mercedes            6       8           6         +2
Alexander Albon          Williams            9       9           9         ✔
Jack Doohan              Alpine              19      10          15        -5
Esteban Ocon             Haas                18      11          18        -7
Oliver Bearman           Haas                10      12          10        +2
Lance Stroll             Aston Martin        20      13          20        -7
Carlos Sainz Jr          Williams            15      14          14        ✔
Gabriel Bortoleto        Kick Sauber         17      15          19        -4
Nico Hulkenberg          Kick Sauber         16      16          16        ✔
Liam Lawson              Racing Bulls        13      17          17        ✔
Yuki Tsunoda             Red Bull Racing     14      18          12        +6
Fernando Alonso          Aston Martin        12      19          11        +8
Pierre Gasly             Alpine              11      20          13        +7
--------------------------------------------------

🏁 Predictions for Bahrain

Driver                   Team                Grid    Predicted   Actual    Difference
Oscar Piastri            McLaren             1       1           1         ✔
Andrea Kimi Antonelli    Mercedes            5       2           11        -9
George Russell           Mercedes            3       3           2         +1
Pierre Gasly             Alpine              4       4           7         -3
Charles Leclerc          Ferrari             2       5           4         +1
Lando Norris             McLaren             6       6           3         +3
Max Verstappen           Red Bull Racing     7       7           6         +1
Carlos Sainz Jr          Williams            8       8           NC        X
Lewis Hamilton           Ferrari             9       9           5         +4
Yuki Tsunoda             Red Bull Racing     10      10          9         +1
Lance Stroll             Aston Martin        19      11          17        -6
Alexander Albon          Williams            15      12          12        ✔
Gabriel Bortoleto        Kick Sauber         18      13          18        -5
Liam Lawson              Racing Bulls        17      14          16        -2
Nico Hulkenberg          Kick Sauber         16      15          DQ        X
Oliver Bearman           Haas                20      16          10        +6
Fernando Alonso          Aston Martin        13      17          15        +2
Isack Hadjar             Racing Bulls        12      18          13        +5
Esteban Ocon             Haas                14      19          8         +11
Jack Doohan              Alpine              11      20          14        +6
--------------------------------------------------

🏁 Predictions for Saudi Arabia

Driver                   Team                Grid    Predicted   Actual    Difference
Max Verstappen           Red Bull Racing     1       1           2         -1
Andrea Kimi Antonelli    Mercedes            5       2           6         -4
George Russell           Mercedes            3       3           5         -2
Oscar Piastri            McLaren             2       4           1         +3
Charles Leclerc          Ferrari             4       5           3         +2
Lewis Hamilton           Ferrari             7       6           7         -1
Yuki Tsunoda             Red Bull Racing     8       7           NC        X
Carlos Sainz Jr          Williams            6       8           8         ✔
Pierre Gasly             Alpine              9       9           NC        X
Esteban Ocon             Haas                19      10          14        -4
Lando Norris             McLaren             10      11          4         +7
Nico Hulkenberg          Kick Sauber         18      12          15        -3
Gabriel Bortoleto        Kick Sauber         20      13          18        -5
Oliver Bearman           Haas                15      14          13        +1
Jack Doohan              Alpine              17      15          17        -2
Lance Stroll             Aston Martin        16      16          16        ✔
Fernando Alonso          Aston Martin        13      17          11        +6
Isack Hadjar             Racing Bulls        14      18          10        +8
Liam Lawson              Racing Bulls        12      19          12        +7
Alexander Albon          Williams            11      20          9         +11
--------------------------------------------------

🏁 Predictions for Miami

Driver                   Team                Grid    Predicted   Actual    Difference
Max Verstappen           Red Bull Racing     1       1           4         -3
George Russell           Mercedes            5       2           3         -1
Andrea Kimi Antonelli    Mercedes            3       3           6         -3
Oscar Piastri            McLaren             4       4           1         +3
Lando Norris             McLaren             2       5           2         +3
Alexander Albon          Williams            7       6           5         +1
Charles Leclerc          Ferrari             8       7           7         ✔
Oliver Bearman           Haas                19      8           NC        X
Carlos Sainz Jr          Williams            6       9           9         ✔
Esteban Ocon             Haas                9       10          12        -2
Lance Stroll             Aston Martin        18      11          16        -5
Yuki Tsunoda             Red Bull Racing     10      12          10        +2
Pierre Gasly             Alpine              20      13          13        ✔
Fernando Alonso          Aston Martin        17      14          15        -1
Nico Hulkenberg          Kick Sauber         16      15          14        +1
Liam Lawson              Racing Bulls        15      16          NC        X
Isack Hadjar             Racing Bulls        11      17          11        +6
Gabriel Bortoleto        Kick Sauber         13      18          NC        X
Lewis Hamilton           Ferrari             12      19          8         +11
Jack Doohan              Alpine              14      20          NC        X
--------------------------------------------------

Team Results¶

In Formula 1, points are awarded based on finishing positions in the race, with different point values assigned to the top 10 finishers. Finishers outside of the top 10 aren't awared any points. The current F1 points distribution is as follows:

1st place: 25 points

2nd place: 18 points

3rd place: 15 points

4th place: 12 points

5th place: 10 points

6th place: 8 points

7th place: 6 points

8th place: 4 points

9th place: 2 points

10th place: 1 point

In [40]:
import pandas as pd

real_results_df = pd.read_csv("TEAM_RESULTS_2025.csv", header=None, names=["Real Team Position", "Team", "Real Points"])

points_dict = {
    1: 25, 2: 18, 3: 15, 4: 12, 5: 10,
    6: 8, 7: 6, 8: 4, 9: 2, 10: 1
}

def get_points(pos):
    return points_dict.get(pos, 0)

combined_df = pd.concat(all_predictions, ignore_index=True)
combined_df['Predicted Points'] = combined_df['Predicted Position'].apply(get_points)

total_team_points = combined_df.groupby('Team', as_index=False)['Predicted Points'].sum()
total_team_points['Team Rank'] = total_team_points['Predicted Points'].rank(method='first', ascending=False).astype(int)
total_team_points = total_team_points.sort_values('Team Rank')

merged_df = pd.merge(total_team_points, real_results_df, on="Team", how="left")

merged_df['Real Team Position'] = pd.to_numeric(merged_df['Real Team Position'], errors='coerce')
merged_df['Team Rank'] = pd.to_numeric(merged_df['Team Rank'], errors='coerce')

if merged_df['Real Team Position'].isna().any() or merged_df['Team Rank'].isna().any():
    print("Warning: NaN values found in 'Real Team Position' or 'Team Rank'.")
    print(merged_df[merged_df['Real Team Position'].isna() | merged_df['Team Rank'].isna()])

def get_accuracy(row):
    try:
        actual = row['Real Team Position']
        predicted = row['Team Rank']

        if pd.isna(actual) or pd.isna(predicted):  
            return 'X'
        
        if actual == predicted:
            return '✔'
        elif actual < predicted:
            return f"+{predicted - actual}"
        else:
            return f"-{actual - predicted}"
    except Exception as e:
        print(f"Error in accuracy calculation: {e}")
        return 'X'

merged_df['Accuracy'] = merged_df.apply(get_accuracy, axis=1)

merged_df = merged_df[['Team', 'Predicted Points', 'Team Rank', 'Real Points', 'Real Team Position', 'Accuracy']]
merged_df.to_csv("website/data/final_team_rankings.csv", index=False)
merged_df.head(15)
Out[40]:
Team Predicted Points Team Rank Real Points Real Team Position Accuracy
0 McLaren 175 1 246 1 ✔
1 Mercedes 149 2 141 2 ✔
2 Red Bull Racing 115 3 105 3 ✔
3 Ferrari 82 4 94 4 ✔
4 Williams 30 5 37 5 ✔
5 Racing Bulls 28 6 8 8 -2
6 Alpine 21 7 7 9 -2
7 Haas 6 8 20 6 +2
8 Aston Martin 0 9 14 7 +2
9 Kick Sauber 0 10 6 10 ✔

Final Conclusions¶

Data Analysis

In this project we looked at the data from the 2024 season and analysed with the help of graphs. We look into the drivers and teams performances from last season to see if there a favourites for the season we are trying to predict. We look at the tracks with the most NCs and the average time between the drivers for each track to get a better idea of what tracks might prove easier or more challenging. We also take a look at the average time of each driver to get a better idea of how close the competition in Formula 1 is. After we have taken a look at the drivers and the tracks, we look into each driver's performance on each track and look for the relationship between the starting and finishing postition and how it changes through the grid. The visualisation part of this notebook is important since it serves as a way of us to analyze the data before we get into the modeling.

Modeling

We looked into different types of models: K-Nearest Neighbours, Linear Regression, Descision Trees, Support Vector Regressor and Random Forrest. We applied boosting with AdaBoost or hyperparameter tuning to the models to maximize their efficiency and eventually ended up using the Random Forrest model. After making the predictions, we can compare the predictions with the real-life outcomes either in the notebook or in the website dashboard.

Domain Analysis

After completing the modeling stage, it's important to reflect on the broader implications of using AI in a high-stakes, data-driven sport like Formula 1. While our prediction model serves as a useful prototype to understand patterns in team and driver performance, it also raises an important question: Should we blindly trust AI? The short answer is no — especially not in isolation.

AI can provide valuable insights, uncover trends, and assist with data-heavy decisions, but it lacks the nuance, intuition, and contextual awareness that human strategists bring to the table. In practice, especially for an F1 team, AI should be seen as an advisor, not a decision-maker. It can help simulate outcomes, evaluate probabilities, and reduce human error in some cases — but final calls should always involve expert judgment, especially given the unpredictable nature of racing.

Looking forward, the role of AI in Formula 1 will almost certainly grow. As data collection improves models will become more and more accurate and possibly even predictive on a per-lap basis. In the context of our project, the model has shown potential and offers a foundation for future development. With more detailed data and advanced features, it could become a practical tool for analysts, teams, and fans alike. But as far as my honest honest advice goes? Treat AI as a co-pilot, not the driver.